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Abstract 

We consider phrase based Language Mod¬ 
els (LM), which generalize the commonly 
used word level models. Similar concept on 
phrase based LMs appears in speech recog¬ 
nition, which is rather specialized and thus 
less suitable for machine translation (MT). 
In contrast to the dependency LM, we 
first introduce the exhaustive phrase-based 
LMs tailored for MT use. Preliminary ex¬ 
perimental results show that our approach 
outperform word based LMs with the re¬ 
spect to perplexity and translation quality. 


1 Introduction 

Statistical language models estimating the 
distribution of various natural language 
phenomena are crucial for many applica¬ 
tions. In machine translation, it measnres 
the fluency and well-formness of a trans¬ 
lation, and therefore is important for the 


translation quality, see ( 

Och, 2002 

(Koehn, Och and Marcu, 2003 

) etc. 


Common applications of LMs include esti¬ 
mating the distribution based on N-gram cover¬ 
age of words, to predict word and word orders, 
as in dStolcke, 2002D and ( ]Lafferty et. al., 2001 ). 
The independence assnmption for each word is 
one of the simplifying method widely adopted. 
However, it does not hold in textual data, and 

* This version of the paper was submitted for review 
to EMNLP 2013. The title, the idea and the content of 
this paper was presented by the first author in the ma¬ 
chine translation group meeting at the MSRA-NLC lab 
(Microsoft Research Asia, Natural Language Computing) 
on July 16, 2013. 


underlying content structures need to be inves¬ 
tigated as discussed in ( Gao et. al., 2004 ). 

We model the prediction of phrase and phrase 
orders. By considering all word sequences as 
phrases, the dependency inside a phrase is pre¬ 
served, and the phrase level structure of a sen¬ 
tence can be learned from observations. This 
can be considered as an n-gram model on the 
n-gram of words, therefore word based LM is a 
special case of phrase based LM if only single¬ 
word phrases are considered. Intuitively our ap¬ 
proach has the following advantages: 

1) Long distance dependency: The phrase 
based LM can capture the long distance rela¬ 
tionship easily. To capture the sentence level de¬ 
pendency, e.g. between the first and last word of 
the sentence in Table [H we need a 7-gram word 
based LM, but only a 3-gram phrase based LM, 
if we take “played the basketball” and “the day 
before yesterday” as phrases. 

2) Consistent translation unit with phrase 
based MT: Some words may acquire meaning 
only in context, such as “day”, or “the” in “the 
day before yesterday” in Table [TJ Considering 
the frequent phrases as single units will reduce 
the entropy of the language model. More im¬ 
portantly, current MT is performed on phrases, 
which is taken as the translation unit. The 
translation task is to predict the next phrase, 
which corresponds to the phrased based LM. 

3) Fewer independence assumptions in statis¬ 
tical models: The sentence probability is com¬ 
puted as the product of the single word prob¬ 
abilities in the word based n-gram LM and the 
product of the phrase probabilities in the phrase 
based n-gram LM, given their histories. The less 

















Words 

John 

played 

basketball 

the 

day 

before 

yesterday 

w{ 

Wi 

W2 

W3 

W4 

W5 

Wq 

Wj 

Segmentations 

John 

1 played 

basketball 

the 

day 

before 

yesterday 

ki 

fci = 1 


/C2 = 3 



fc3 = 7 


pi 

Pi = Wl 


P2 = W2W3 


P 3 

= W^W^WqWy 

Re-ordered 

John 

the day before yesterday 


play basketball 

Translation 









Table 1: Phrase segmentation example. 


words/phrases in a sentence, the fewer mistakes 
the LM may contain due to less independence 
assumption on words/phrases. Once the phrase 
segmentation is fixed, the number of elements 
via phrase based LM is much less than that via 
the word based LM. Therefore, our approach is 
less likely to obtain errors due to assumptions. 

4) Phrase boundaries as additional informa¬ 
tion: We consider different segmentation of 
phrases in one sentence as a hidden variable, 
which provides additional constraints to align 
phrases in translation. Therefore, the constraint 
alignment in the blocks of words can provide 
more information than the word based LM. 


Comparison to Previous Work In the 

dependency or structured LM, phrases cor¬ 
responding to the grammars are considered, 
and dependencies are extracted, such as in 
( Gao et. ah, 2004 ) and in ( Shen et. ah, 2008 ). 
However, in the phrase based SMT, even phrases 
violating the grammar structure may help as a 
translation unit. For instance, the partial phrase 
“the day before” may appear both in “the day 
before yesterday” and “the day before Spring”. 
Most importantly, the phrase candidates in our 
phrase based LM are same as that in the phrase 
based translation, therefore are more consistent 
in the whole translation process, as mentioned 
in item 2 in Section 1. 

Some researchers have proposed their 
phrase based LM for speech recognition. In 

and ( [Tang, 2002[ ), 
to 


(|Kuo and Reichl, 1999[) 


new 

con 


phrases are added 
with different measure 


the lexi- 
function. 

In (Heeman and Damnati, 1997), a differ¬ 
ent LM was proposed which derived the phrase 
probabilities from a language model built at 


the lexical level. Nonetheless, these methods do 
not consider the dependency between phrases 
and the re-ordering problem, and therefore are 
not suitable for the MT application. 

2 Phrase Based LM 


We are given a sentence as a sequence of words 
w{ = wiW2 ■ ■ ■ Wi - ■ ■ wj{i G 1,2, • • ■ , /), where I 
is the sentence length. 

In the word based LM ( [Stolc^, 2002 ), the 
probability of a sentence Pr(t(;{) Ets defined as 
the product of the probabilities of each word 
given its previous ij — 1 words: 

pK) = npKI<:i+i) (1) 


2=1 


The positions of phrase boundaries on a word 
sequence w{ is indicated by /cq = 0 and K = 
kf = kik 2 ■ ■ ■ kj ■ ■ ■ kj{j € 1,2,-•• ,J), where 
kj € {1,2,-•• ,I},kj-i < kj,kj = I, and J is 
the number of phrases in the sentence. We use 
kj to indicate that the j-th phrase segmentation 
is placed after the word Wk ■ and in front of word 
Wkj+i, where I < j < J■ feo is a boundary on 
the left side of the first word wi , which is defined 
as 0, and kj is always placed after the last word 
wj and therefore equals I. 

An example is illustrated in Table [H The En¬ 
glish sentence (rc() contains seven words {I = 
7), where wi denotes “John”, etc. The hrst 
phrase segmentation boundary is placed after 
the first word, and the second boundary is after 
the third word {k = 3) and so on. The phrase se¬ 
quence Pi in this sentence have a different order 

^The notational convention will be as follows: we use 
the symbol Pr to denote general probability distributions 
with (almost) no specific assumptions. In contrast, for 
model-based probability distributions, we use the generic 
symbol P(A). 














than that in its translation, on the phrase level. 
Hence, the phrase based LM advances the word 
based LM in learning the phrase re-ordering. 

(1) Model description Given a sequence of 

words Wi and its phrase segmentation bound¬ 
aries kf, a sentence can also be represented 
in the form of a sequence of phrases p/ = 
PiP 2 - ■ - Pj ■ ■ - Pjij G - ,J), and each in¬ 

dividual phrase pj is dehned as 

Pj = Wkj_,+i ■ ■ ■ Wk^ = 

In phrase based LM, we consider the phrase seg¬ 
mentation ki as hidden variable and the Equa¬ 
tion [T] can be extended as follows: 

Pr(u;() = '^Pr{wi,K) 

K 

= Y.Pr{pi\k()-Priki) ( 2 ) 

k(,j 

(2) Sentence probability For the segmen¬ 
tation prior probability, we assume a uniform 
distribution for simplicity, i.e. P^kf) = l/|Lir|, 
where the number of different K, i.e. \K\ = 2^ if 
not considering the maximum phrase or phrase 
n-gram length; To compute the Pr(t(;j”), we con¬ 
sider either two approaches: 

• Sum Model (Baum- Welch) 

We consider all 2^ segmentation candidates. 
Equation [2] is defined as 

J 

Pr«„™(u;() « '^X\^{Pj\p’j-l,+i) ■ 


• Max Model (Viterbi) 

The sentence probability formula of the sec¬ 
ond model is defined as 

J 

Pmaxiwi) « ma^YlP{pj\p^jZi+i) -Piki). 

1 j=i 

In practice we select the segmentation that 
maximizes the perplexity of the sentence 
instead of the probability to consider the 
length normalization. 


(3) Perplexity Sentence perplexity and text 
perplexity in the SUM model use the same def¬ 
inition as that in the word based LM. Sentence 
perplexity in the max model is dehned as 

PPL{w{) = argmin[P(t(;(, 

k(,J 

(4) Parameter estimation We apply maxi¬ 
mum likelihood to estimate probabilities in both 
SUM model and MAX model : 

P(p.|pfcU.) = P) 

where C'(-) is the frequency of a phrase. The uni¬ 
gram phrase probability is P (p) = , and C is 

the frequency of all single phrases, in the train¬ 
ing text. Since we generate exponential number 
of phrases to the sentence length, the number of 
parameters is huge. Therefore, we set the max¬ 
imum n-gram length on the phrase level (note 
not the phrase length) as = 3 in experiments. 


(5) Smoothing For the unseen events, we 
perform Good-Turing smoothing as commonly 
done in word based LMs. Moreover, we inter¬ 
polate between the phrase probability and the 
product of single word probabilities in a phrase 
using a convex optimization: 




XP{p,\p]-_\^,) + {l-X) 


ntiP(^.) 


where phrase pj is made up of j' words . The 
idea of this interpolation is to make the probabil¬ 
ity of a phrase consisting of of j' words smooth 
with a jLword unigram probability after nor¬ 
malization. In our experiments, we set A = 0.4 
for convenience. 


(6) Algorithm of calculating phrase n- 
gram counts The training task is to calculate 
n-gram counts on the phrase level in Equation [3j 
Given a training corpus Wf, where there are 
S sentences Ws (s = 1, 2, • • • , S), our goal is to 
to compute C'(-), for all phrase n-grams that the 
number of phrases is no greater than N. There¬ 
fore, for each sentence ref, we should find out 
every n-gram phrases that 0 < n < N. 




Data 

Sentences 

Words 

Vocabulary 

Training 

54887 

576778 

23350 

Dev2010 

202 

1887 

636 

Tst2010 

247 

2170 

617 

Tst2011 

334 

2916 

765 


Table 2: Statistics of corpora with sentence length 
no greater than 15 in training and 10 in test. 


n 

Base 

Sum 

SUM+S. 

Max 

Max+S. 

1 

676.1 

85.5 

112.5 

625.7 

1129.4 

2 

180.8 

52.6 

72.1 

161.1 

306.2 

3 

162.3 

52.5 

72.2 

140.4 

266.5 

4 

162.5 

52.6 

72.3 

141.1 

267.6 


Table 3: Perplexities on Tst2011 calculated based on 
various n-gram LMs with n = 1, 2, 3,4. 


Model 

Dev2010 

Tst2010 

Tst2011 

Base 

11.26 

13.10 

15.05 

Word 

11.92 

12.93 

14.76 

Sum 

11.86 

12.77 

14.80 

SUM+S. 

12.02 

12.54 

14.76 

Max 

11.61 

12.99 

15.34 

Max+S. 

11.56 

13.55 

15.27 


Table 4: Translation performance on N-best list us¬ 
ing different LMs in BLEU[%]. 


Base: 

but we need a success 

Max: 

but we need a way to success . 

Ref: 

we certainly need one to succeed . 

Base: 

there is a specific steps that 

Max: 

there is a specific steps . 

Ref: 

there is step-by-step instructions on this . 


We do Dynamic Programming to collect the 
phrase n-grams in one sentence w\'. 

<5(1, d] w{) = {p = <h < d < 1} 

Q{n,d-, w{) = 

Ufe Q{n — l,b — l-,w{) (B p = wf, \/n < b < d < I, 

where Q{-) is the auxiliary function denoting the 
multiset of all phrase n-grams or unigram end¬ 
ing at position d (1 < n < A^). b denotes the 
starting word position of the last phrase in the 
multiset. The {•} is a multiset, and 0 means 
to append the element to each element in the 
multiset. Ub denotes the union of multisets. Af¬ 
ter appending p, we consider all b that is no less 
than n and no greater than d. 

The phrase counts C{-) is the sum of all phrase 
n-grams from all sentences Wf, with each sen¬ 
tence Ws = w {, and | • | is the number of elements 
in a multiset: 

s 

C(K) = j;iKGuSg(n,d;W,)| 

S = 1 

3 Experiments 

This is an ongoing work, and we per¬ 
formed preliminary experiments on the 
IWSLT qiWSLT, 201T] ) task, then evalu¬ 
ated the LM performance by measuring the LM 
perplexity and the MT translation performance. 


Table 5: Examples of sentence outputs with baseline 
method and with the MAX model. 


Because of the computational requirement, we 
only employed sentences which contain no 
more than 15 words in the training corpus and 
no more than 10 words in the test corpora 
(Dev2010, on Tst2010 and on Tst2011), as 
shown in Table [2j 

We took word based LM in Equation [T] as the 
baseline method (Base). We calculated the per¬ 
plexities of Tst2011 with different n-gram orders 
using both SUM model and MAX model, with 
and without smoothing (S.) as in Section 2. Ta¬ 
ble [3] shows that perplexities in our approaches 
are all lower than those in the baseline. 


For MT, we selected the single best trans¬ 
lation output based on the LM perplexity of 
the 100-best translation candiates, using differ¬ 
ent LMs as shown in Table [H Max model 
along with smoothing outperforms the baseline 
method under all three test sets with the BLEU 


score ( Papineni et. ah, 2002 ) increase of 0.3% 
on Dev2010, 0.45% on Tst2010, and 0.22% on 
Tst2011, respectively. 


Table[5]shows two examples from the Tst2010, 
where we can see that our MAX model generates 
better selection results than the baseline method 
in these cases. 




































4 Conclusion 

We showed the preliminary results that a phrase 
based LM can improve the performance of MT 
systems and the LM perplexity. We presented 
two phrase based models which consider phrases 
as the basic components of a sentence and per¬ 
form exhaustive search. Our future work will 
focus on the efficiency for a larger data track 
as well as the improvements on the smoothing 
methods. 
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